Deep Learning Math  ·  The Lab

Reverse-Engineer a Mind

an introduction to mechanistic interpretability

In the last lab you trained a tiny model. It works — but it's a black box of numbers. This lab opens it up. We'll recover the actual algorithm hiding in the weights, in plain words. That's mechanistic interpretability.

model auto-trained on load · {{ trainedNote }}
The premise

A trained network isn't magic. It's an algorithm written in numbers.

Nobody programmed our model's rules by hand — training discovered them and stored them in a grid of weights. Mechanistic interpretability is the craft of reading those weights back out as something a human can understand: features, rules, circuits. Not “the model is 91% accurate,” but “here is exactly what it computes, and why.”

Real models are dauntingly large, so the field starts where we can win: the smallest model that still does something. Ours is the simplest possible — a single matrix — which means we can decode it completely. The perfect first patient.

Read its mind — one row is one belief.

Pick a letter. The model's entire opinion about “what comes next” is a single row of the weight matrix, turned into probabilities. We don't have to run the model — we can just read it off the page. And look: it matches what the data actually does. That's a learned feature, caught in the open.

after this letter…
…the model expects (violet = its belief · tick = the real data)
{{ b.ch }}
{{ b.compare }}
TOOL 02 causal intervention

Intervene — break a part and watch what dies.

Looking isn't proof. To show a weight causes a behavior, you delete it and see what breaks — an ablation. Knock out the column that lets words end and the model never shuts up. The dark stripe in the matrix is the damage; the words below are the symptom.

the weights now
pick a lesion
active lesion
{{ abLabel }}
{{ s }}
TOOL 03 the extracted program

Extract the algorithm it wrote for itself.

Rank every row by its most confident prediction and you get the model's rulebook, in English. This is the whole point of interpretability: turning a wall of numbers into statements you can check.

its most confident rules
after {{ r.from }}{{ r.to }} {{ r.pct }}
always-greedy path

If it always took its single most likely letter, never gambling, it would write exactly one word, forever:

{{ greedyWord }}

That single deterministic trace is the model's backbone — temperature and sampling just let it wander off this path.

Scaling the idea up

Our matrix was a toy. The real circuits are the same idea, harder to see.

Everything you just did — reading weights, ablating, extracting rules — is exactly how researchers study GPT-scale models. Only the objects get richer. Here's the vocabulary, mapped to math you already own.

Features are directions

A concept — “this is code,” “this is in French” — is a direction in the model's activation vectors. To test for it, you take a dot product (Chapter 4). Interpretability is largely the hunt for meaningful directions.

The residual stream

Every layer reads from and writes to one shared vector per token — a running notebook. Each matrix (Chapter 5) edits it a little. The “logit lens” you used is just reading that notebook partway through.

Induction heads — the famous circuit

The first circuit ever fully reverse-engineered: an attention head that finds the previous place the current token appeared and copies whatever came next. It's how a model that saw “X Y” earlier, on meeting “X” again, predicts “Y” — copy what followed last time. Pure dot-product matching (Ch.4) plus softmax (Ch.6) — the attention you met in Chapter 7.

Superposition — why it's hard

Models cram far more features than they have dimensions, so a single neuron lights up for many unrelated things. Our one clean matrix was a gift; real weights are tangled, which is why tools like sparse autoencoders exist to pull the features apart.

You just opened a model and explained what it does. That's the whole job.

The same three moves — read the weights, intervene to prove cause, extract the rule — scale all the way up to making frontier models honest and safe. The toy here is small; the method is the real thing.